{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "059756b7",
   "metadata": {},
   "source": [
    "# CSV Loader\n",
    "\n",
    "- Author: [JoonHo Kim](https://github.com/jhboyo)\n",
    "- Peer Review : [syshin0116](https://github.com/syshin0116), [forwardyoung](https://github.com/forwardyoung)\n",
    "- Proofread : [Q0211](https://github.com/Q0211)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/04-CSVLoader.ipynb)\n",
    "[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/04-CSVLoader.ipynb)\n",
    "\n",
    "\n",
    "## Overview\n",
    "\n",
    "This tutorial provides a comprehensive guide on how to use the ```CSVLoader``` utility in LangChain to seamlessly integrate data from CSV files into your applications. The ```CSVLoader``` is a powerful tool for processing structured data, enabling developers to extract, parse, and utilize information from CSV files within the LangChain framework.\n",
    "\n",
    "[Comma-Separated Values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) is one of the most common formats for storing and exchanging data.\n",
    "\n",
    "```CSVLoader``` simplifies the process of loading, parsing, and extracting data from CSV files, allowing developers to seamlessly incorporate this information into LangChain workflows.\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [How to load CSVs](#how-to-load-csvs)\n",
    "- [Customizing the CSV parsing and loading](#customizing-the-csv-parsing-and-loading)\n",
    "- [Specify a column to identify the document source](#specify-a-column-to-identify-the-document-source)\n",
    "- [Generating XML document format](#generating-xml-document-format)\n",
    "- [UnstructuredCSVLoader](#unstructuredcsvloader)\n",
    "- [DataFrameLoader](#dataframeloader)\n",
    "\n",
    "\n",
    "### References\n",
    "\n",
    "- [Langchain CSVLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html)\n",
    "- [Langchain How to load CSVs](https://python.langchain.com/docs/how_to/document_loader_csv)\n",
    "- [Langchain DataFrameLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.dataframe.DataFrameLoader.html#dataframeloader)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "923a83b4",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can check out the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.\n",
    "- ```unstructured``` package is a Python library for extracting text and metadata from various document formats like PDF and CSV\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "8e8ad1ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "ea1677a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain_community\",\n",
    "        \"unstructured\"\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "c798a6f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "if not load_dotenv():\n",
    "    set_env(\n",
    "        {\n",
    "            \"OPENAI_API_KEY\": \"\",\n",
    "            \"LANGCHAIN_API_KEY\": \"\",\n",
    "            \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "            \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "            \"LANGCHAIN_PROJECT\": \"04-CSV-Loader\",\n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d8e843e",
   "metadata": {},
   "source": [
    "You can alternatively set ```OPENAI_API_KEY``` in ```.env``` file and load it. \n",
    "\n",
    "[Note] This is not necessary if you've already set ```OPENAI_API_KEY``` in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "433c7da6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df03b400",
   "metadata": {},
   "source": [
    "## How to load CSVs\n",
    "\n",
    "A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. LangChain can help you load CSV files easily—just import ```CSVLoader``` to get started. \n",
    "\n",
    "Each line of the file is a data record, and each record consists of one or more fields, separated by commas. \n",
    "\n",
    "We use a sample CSV file for the example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "ea60830d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='PassengerId: 1\n",
      "Survived: 0\n",
      "Pclass: 3\n",
      "Name: Braund, Mr. Owen Harris\n",
      "Sex: male\n",
      "Age: 22\n",
      "SibSp: 1\n",
      "Parch: 0\n",
      "Ticket: A/5 21171\n",
      "Fare: 7.25\n",
      "Cabin: \n",
      "Embarked: S' metadata={'source': './data/titanic.csv', 'row': 0}\n",
      "page_content='PassengerId: 2\n",
      "Survived: 1\n",
      "Pclass: 1\n",
      "Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer)\n",
      "Sex: female\n",
      "Age: 38\n",
      "SibSp: 1\n",
      "Parch: 0\n",
      "Ticket: PC 17599\n",
      "Fare: 71.2833\n",
      "Cabin: C85\n",
      "Embarked: C' metadata={'source': './data/titanic.csv', 'row': 1}\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders.csv_loader import CSVLoader\n",
    "\n",
    "# Create CSVLoader instance\n",
    "loader = CSVLoader(file_path=\"./data/titanic.csv\")\n",
    "\n",
    "# Load documents\n",
    "docs = loader.load()\n",
    "\n",
    "for record in docs[:2]:\n",
    "    print(record)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "358f1c4c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PassengerId: 2\n",
      "Survived: 1\n",
      "Pclass: 1\n",
      "Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer)\n",
      "Sex: female\n",
      "Age: 38\n",
      "SibSp: 1\n",
      "Parch: 0\n",
      "Ticket: PC 17599\n",
      "Fare: 71.2833\n",
      "Cabin: C85\n",
      "Embarked: C\n"
     ]
    }
   ],
   "source": [
    "print(docs[1].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2136512",
   "metadata": {},
   "source": [
    "## Customizing the CSV parsing and loading\n",
    "\n",
    "```CSVLoader``` accepts a ```csv_args``` keyword argument that supports customization of the parameters passed to Python's ```csv.DictReader```. This allows you to handle various CSV formats, such as custom delimiters, quote characters, or specific newline handling. \n",
    "\n",
    "See Python's [csv module](https://docs.python.org/3/library/csv.html) documentation for more information on supported ```csv_args``` and how to tailor the parsing to your specific needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "a4b1fc3a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Passenger ID: 1\n",
      "Survival (1: Survived, 0: Died): 0\n",
      "Passenger Class: 3\n",
      "Name: Braund, Mr. Owen Harris\n",
      "Sex: male\n",
      "Age: 22\n",
      "Number of Siblings/Spouses Aboard: 1\n",
      "Number of Parents/Children Aboard: 0\n",
      "Ticket Number: A/5 21171\n",
      "Fare: 7.25\n",
      "Cabin: \n",
      "Port of Embarkation: S\n"
     ]
    }
   ],
   "source": [
    "loader = CSVLoader(\n",
    "    file_path=\"./data/titanic.csv\",\n",
    "    csv_args={\n",
    "        \"delimiter\": \",\",\n",
    "        \"quotechar\": '\"',\n",
    "        \"fieldnames\": [\n",
    "            \"Passenger ID\",\n",
    "            \"Survival (1: Survived, 0: Died)\",\n",
    "            \"Passenger Class\",\n",
    "            \"Name\",\n",
    "            \"Sex\",\n",
    "            \"Age\",\n",
    "            \"Number of Siblings/Spouses Aboard\",\n",
    "            \"Number of Parents/Children Aboard\",\n",
    "            \"Ticket Number\",\n",
    "            \"Fare\",\n",
    "            \"Cabin\",\n",
    "            \"Port of Embarkation\",\n",
    "        ],\n",
    "    },\n",
    ")\n",
    "\n",
    "docs = loader.load()\n",
    "\n",
    "print(docs[1].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74acacb3",
   "metadata": {},
   "source": [
    "## Specify a column to identify the document source"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31c71c0d",
   "metadata": {},
   "source": [
    "You should use the ```source_column``` argument to specify the source of the documents generated from each row. Otherwise ```file_path``` will be used as the source for all documents created from the CSV file.\n",
    "\n",
    "This is particularly useful when using the documents loaded from a CSV file in a chain designed to answer questions based on their source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "0b4f08a1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='PassengerId: 2\n",
      "Survived: 1\n",
      "Pclass: 1\n",
      "Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer)\n",
      "Sex: female\n",
      "Age: 38\n",
      "SibSp: 1\n",
      "Parch: 0\n",
      "Ticket: PC 17599\n",
      "Fare: 71.2833\n",
      "Cabin: C85\n",
      "Embarked: C' metadata={'source': '2', 'row': 1}\n",
      "{'source': '2', 'row': 1}\n"
     ]
    }
   ],
   "source": [
    "loader = CSVLoader(\n",
    "    file_path=\"./data/titanic.csv\",\n",
    "    source_column=\"PassengerId\",  # Specify the source column\n",
    ")\n",
    "\n",
    "docs = loader.load()  \n",
    "\n",
    "print(docs[1])\n",
    "print(docs[1].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3de64ab",
   "metadata": {},
   "source": [
    "## Generating XML document format\n",
    "\n",
    "This example shows how to generate XML Document format from ```CSVLoader```. By processing data from a CSV file, you can convert its rows and columns into a structured XML representation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a246efd6",
   "metadata": {},
   "source": [
    "Convert a row in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "106461e0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<row><PassengerId>2</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</Name><Sex>female</Sex><Age>38</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>PC 17599</Ticket><Fare>71.2833</Fare><Cabin>C85</Cabin><Embarked>C</Embarked></row>\n"
     ]
    }
   ],
   "source": [
    "row = docs[1].page_content.split(\"\\n\")  # split by new line\n",
    "row_str = \"<row>\"\n",
    "for element in row:\n",
    "    splitted_element = element.split(\":\")  # split by \":\"\n",
    "    value = splitted_element[-1]  # get value\n",
    "    col = \":\".join(splitted_element[:-1])  # get column name\n",
    "\n",
    "    row_str += f\"<{col}>{value.strip()}</{col}>\"\n",
    "row_str += \"</row>\"\n",
    "print(row_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb1381a0",
   "metadata": {},
   "source": [
    "Convert entire rows in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "dca7126d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<row><PassengerId>2</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</Name><Sex>female</Sex><Age>38</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>PC 17599</Ticket><Fare>71.2833</Fare><Cabin>C85</Cabin><Embarked>C</Embarked></row>\n",
      "<row><PassengerId>3</PassengerId><Survived>1</Survived><Pclass>3</Pclass><Name>Heikkinen, Miss. Laina</Name><Sex>female</Sex><Age>26</Age><SibSp>0</SibSp><Parch>0</Parch><Ticket>STON/O2. 3101282</Ticket><Fare>7.925</Fare><Cabin></Cabin><Embarked>S</Embarked></row>\n",
      "<row><PassengerId>4</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Futrelle, Mrs. Jacques Heath (Lily May Peel)</Name><Sex>female</Sex><Age>35</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>113803</Ticket><Fare>53.1</Fare><Cabin>C123</Cabin><Embarked>S</Embarked></row>\n",
      "<row><PassengerId>5</PassengerId><Survived>0</Survived><Pclass>3</Pclass><Name>Allen, Mr. William Henry</Name><Sex>male</Sex><Age>35</Age><SibSp>0</SibSp><Parch>0</Parch><Ticket>373450</Ticket><Fare>8.05</Fare><Cabin></Cabin><Embarked>S</Embarked></row>\n",
      "<row><PassengerId>6</PassengerId><Survived>0</Survived><Pclass>3</Pclass><Name>Moran, Mr. James</Name><Sex>male</Sex><Age></Age><SibSp>0</SibSp><Parch>0</Parch><Ticket>330877</Ticket><Fare>8.4583</Fare><Cabin></Cabin><Embarked>Q</Embarked></row>\n"
     ]
    }
   ],
   "source": [
    "for doc in docs[1:6]:  # skip header\n",
    "    row = doc.page_content.split(\"\\n\")\n",
    "    row_str = \"<row>\"\n",
    "    for element in row:\n",
    "        splitted_element = element.split(\":\")  # split by \":\"\n",
    "        value = splitted_element[-1]  # get value\n",
    "        col = \":\".join(splitted_element[:-1])  # get column name\n",
    "        row_str += f\"<{col}>{value.strip()}</{col}>\"\n",
    "    row_str += \"</row>\"\n",
    "    print(row_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de2aeefb",
   "metadata": {},
   "source": [
    "## UnstructuredCSVLoader "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b3cfc80",
   "metadata": {},
   "source": [
    "```UnstructuredCSVLoader``` can be used in both ```single``` and ```elements``` mode. If you use the loader in “elements” mode, the CSV file will be a single Unstructured Table element. If you use the loader in elements” mode, an HTML representation of the table will be available in the ```text_as_html``` key in the document metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "31471f0f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<table><tr><td>PassengerId</td><td>Survived</td><td>Pclass</td><td>Name</td><td>Sex</td><td>Age</td><td>SibSp</td><td>Parch</td><td>Ticket</td><td>Fare</td><td>Cabin</td><td>Embarked</td></tr><tr><td>1</td><td>0</td><td>3</td><td>Braund, Mr. Owen Harris</td><td>male</td><td>22</td><td>1</td><td>0</td><td>A/5 21171</td><td>7.25</td><td/><td>S</td></tr><tr><td>2</td><td>1</td><td>1</td><td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td><td>female</td><td>38</td><td>1</td><td>0</td><td>PC 17599</td><td>71.2833</td><td>C85</td><td>C</td></tr><tr><td>3</td><td>1</td><td>3</td><td>Heikkinen, Miss. Laina</td><td>female</td><td>26</td><td>0</td><td>0</td><td>STON/O2. 3101282</td><td>7.925</td><td/><td>S</td></tr><tr><td>4</td><td>1</td><td>1</td><td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader\n",
    "\n",
    "# Generate UnstructuredCSVLoader instance with elements mode\n",
    "loader = UnstructuredCSVLoader(file_path=\"./data/titanic.csv\", mode=\"elements\")\n",
    "\n",
    "docs = loader.load()\n",
    "\n",
    "html_content = docs[0].metadata[\"text_as_html\"]\n",
    "\n",
    "# Partial output due to space constraints\n",
    "print(html_content[:810]) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31e0af6a",
   "metadata": {},
   "source": [
    "## DataFrameLoader\n",
    "\n",
    "```Pandas``` is an open-source data analysis and manipulation tool for the Python programming language. This library is widely used in data science, machine learning, and various fields for working with data.\n",
    "\n",
    "LangChain's ```DataFrameLoader``` is a powerful utility designed to seamlessly integrate ```Pandas```  ```DataFrames``` into LangChain workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "2d6104f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"./data/titanic.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a03d9f7c",
   "metadata": {},
   "source": [
    "Search the first 5 rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "68610091",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ba09856",
   "metadata": {},
   "source": [
    "Parameters ```page_content_column``` (str) – Name of the column containing the page content. Defaults to “text”.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "6e3f205d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Braund, Mr. Owen Harris\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import DataFrameLoader\n",
    "\n",
    "# The Name column of the DataFrame is specified to be used as the content of each document.\n",
    "loader = DataFrameLoader(df, page_content_column=\"Name\")\n",
    "\n",
    "docs = loader.load()\n",
    "\n",
    "print(docs[0].page_content)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17392338",
   "metadata": {},
   "source": [
    "```Lazy Loading``` for large tables. Avoid loading the entire table into memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "e643b63a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Braund, Mr. Owen Harris' metadata={'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': nan, 'Embarked': 'S'}\n"
     ]
    }
   ],
   "source": [
    "# Lazy load records from dataframe.\n",
    "for row in loader.lazy_load():\n",
    "    print(row)\n",
    "    break  # print only the first row\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py-test",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
